Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

Updating cpu-manager-policy=static causes NVML unknown error #966

Closed
henglianghe opened this issue Apr 25, 2019 · 12 comments
Closed

Updating cpu-manager-policy=static causes NVML unknown error #966

henglianghe opened this issue Apr 25, 2019 · 12 comments

Comments

@henglianghe
Copy link

What happened:

  1. After setting the cpu-manager-policy=static of kubenets, the pod with gpu running nvidia-smi will report an error

    Failed to initialize NVML: Unknown Error
    
  2. Setting the cpu-manager-policy=none for kubenets will not cause this error

  3. Sometimes when the pod first runs, nvidia-smi will not give an error, and about 10 seconds later, running nvidia-smi will give an error

  4. Check the reason for the error, find it is when reading /dev/nvidiactl Operation not permitted

    strace -v -a 100 -s 1000 nvidia-smi
    close(3)                                                                                           = 0
    open("/dev/nvidiactl", O_RDWR)                                                                     = -1 EPERM (Operation not permitted)
    open("/dev/nvidiactl", O_RDONLY)                                                                   = -1 EPERM (Operation not permitted)
    fstat(1, {st_dev=makedev(0, 704), st_ino=4, st_mode=S_IFCHR|0620, st_nlink=1, st_uid=0, st_gid=5, st_blksize=1024, st_blocks=0, st_rdev=makedev(136, 1), st_atime=2019/04/23-17:35:28.678347231, st_mtime=2019/04/23-17:35:28.678347231, st_ctime=2019/04/23-17:33:09.682347235}) = 0
    write(1, "Failed to initialize NVML: Unknown Error\n", 41Failed to initialize NVML: Unknown Error
    )                                         = 41
    exit_group(255)                                                                                     = ?
    +++ exited with 255 +++
    
  5. Update the cpu-manager-policy to none and static, and create two pods respectively as test-gpu(nvidia-smi can be run) and test-gpu-err (Running nvidia-smi reports an error).

    1. Check pods's /sys/fs/cgroup/devices/devices, found the difference between the list

    2. test-gpu(nvidia-smi can be run)

      root@super8:/sys/fs/cgroup/devices/kubepods/besteffort/pod52c61ec9-65b5-11e9-8cd2-0cc47aea540c/caca989a1f8d1a8c87f67c04d2d63347a98f52d745c44e77895b3ca4dfd9b18f# cat devices.list 
      c 1:5 rwm
      c 1:3 rwm
      c 1:9 rwm
      c 1:8 rwm
      c 5:0 rwm
      c 5:1 rwm
      c *:* m
      b *:* m
      c 1:7 rwm
      c 136:* rwm
      c 5:2 rwm
      c 10:200 rwm
      c 195:255 rw
      c 195:3 rw
      
    3. test-gpu-err (Running nvidia-smi reports an error)

      root@super8:/sys/fs/cgroup/devices/kubepods/besteffort/podbfa294b1-65aa-11e9-8cd2-0cc47aea540c/771eb2c6d41fe48160000ad481702d09bdda5bfe49d613f96273412e177b449d# cat devices.list 
      c 1:5 rwm
      c 1:3 rwm
      c 1:9 rwm
      c 1:8 rwm
      c 5:0 rwm
      c 5:1 rwm
      c *:* m
      b *:* m
      c 1:7 rwm
      c 136:* rwm
      c 5:2 rwm
      c 10:200 rwm
      
  6. so, After setting the cpu-manager-policy=static of kubenets,pod with gpu can run nvidia-smi command for a short time, But in a function that runs once in 10 seconds, /sys/fs/cgroup/devices/devices.list will be modified to lose read and write access to /dev/nvidiactl (and should have other files), and then cause nvidia-smi error

@klueska
Copy link
Contributor

klueska commented Apr 25, 2019

Unfortunately, this is a known issue. It was first reported here:
NVIDIA/nvidia-container-toolkit#138

The underlying issue is that libnvidia-container injects some devices and modifies some cgroups out-of-band of the container engine it is operating on behalf of when setting a container up for use with GPUs. This causes the internal state of the container engine to be out of sync with what has actually been set up for the container.

For example, if you do a docker inspect on a functioning GPU-enabled container today, you will see that its device list is empty, even though it clearly has the nvidia devices injected into it and the cgroup access to those devices is set up properly.

This has not been an issue until now because everything works fine at initial container creation time. These settings are modified by libnvidia-container only after a container has already been set up by docker and no further updates to the cgroups are necessary.

The problem comes when some external entity hits docker's ContainerUpdate API (whether directly via the CLI or through an API call like the CPUManager in Kubernetes does). When this API is invoked, docker resolves its empty device list to disk, essentially "undoing" what libnvidia-container had set up in regards to these devices.

We need to come up with a solution that allows libnvidia-container to take control of managing these devices on behalf of docker (or any container engine) while properly informing it so that it can keep its internal state in sync.

@klueska
Copy link
Contributor

klueska commented Apr 25, 2019

If your setup is constrained such that GPUs will only ever be used by containers that have CPUSets assigned to them via the static allocation policy, then the following patch to Kubernetes will avoid having docker update its cgroups after these containers are initially launched. This is not a generic enough solution to work in all cases though, unfortunately.

diff --git a/pkg/kubelet/cm/cpumanager/cpu_manager.go b/pkg/kubelet/cm/cpumanager/cpu_manager.go
index 4ccddd5..ff3fbdf 100644
--- a/pkg/kubelet/cm/cpumanager/cpu_manager.go
+++ b/pkg/kubelet/cm/cpumanager/cpu_manager.go
@@ -242,7 +242,8 @@ func (m *manager) reconcileState() (success []reconciledContainer, failure []rec
                        // - policy does not want to track the container
                        // - kubelet has just been restarted - and there is no previous state file
                        // - container has been removed from state by RemoveContainer call (DeletionTimestamp is set)
-                       if _, ok := m.state.GetCPUSet(containerID); !ok {
+                       cset, ok := m.state.GetCPUSet(containerID)
+                       if !ok {
                                if status.Phase == v1.PodRunning && pod.DeletionTimestamp == nil {
                                        klog.V(4).Infof("[cpumanager] reconcileState: container is not present in state - trying to add (pod: %s, container: %s, container id: %s)", pod.Name, container.Name, containerID)
                                        err := m.AddContainer(pod, &container, containerID)
@@ -258,7 +259,13 @@ func (m *manager) reconcileState() (success []reconciledContainer, failure []rec
                                }
                        }

-                       cset := m.state.GetCPUSetOrDefault(containerID)
+                       if !cset.IsEmpty() && m.policy.Name() == string(PolicyStatic) {
+                               klog.V(4).Infof("[cpumanager] reconcileState: skipping container; assigned cpuset unchanged (pod: %s, container: %s, container id: %s, cpuset: \"%v\")", pod.Name, container.Name, containerID, cset)
+                               success = append(success, reconciledContainer{pod.Name, container.Name, containerID})
+                               continue
+                       }
+
+                       cset = m.state.GetDefaultCPUSet()
                        if cset.IsEmpty() {
                                // NOTE: This should not happen outside of tests.
                                klog.Infof("[cpumanager] reconcileState: skipping container; assigned cpuset is empty (pod: %s, container: %s)", pod.Name, container.Name)

@henglianghe
Copy link
Author

henglianghe commented Apr 26, 2019

@klueska Thanks so much as it works when we use your code snippet and setting CPU POLICY to static and QOS of the container is in "guaranteed" mode.

However, is there anyway to get GPU working for those containers that are not "guaranteed" while CPU_MANAGER_POLICY is set to STATIC. Is it something you guys intend to develop or is it possible for me to work it around.

@henglianghe
Copy link
Author

henglianghe commented Apr 26, 2019

@klueska As we know, LXD 3.0.0 is already supported for NVIDIA runtime passthrough.
https://discuss.linuxcontainers.org/t/lxd-3-0-0-has-been-released/1491
If there is a case which kubernetes can create LXD3.0.0 containers instead of docker containers, can we bypass the current problem, since LXD3.0.0 is not a ‘docker' and may not require 'libnvidia-container' ?

@klueska
Copy link
Contributor

klueska commented Apr 26, 2019

Any change made to kubernetes is always going to be a workaround. The real fix needs to come in libnvidia-container or docker or some combination of both.

I've never tried LXD with Kubernetes, so I'm not in a position to say how well it would work or not. I do know that LXD still uses libnvidia-container under the hood though, so it may exhibit the same problems.

Again, the underlying problem is that docker is not told about the devices that libnvidia-container injects into it, so if you come up with a workaround that updates docker's internal state with this information, that should be sufficient.

RenaudWasTaken pushed a commit to NVIDIA/k8s-device-plugin that referenced this issue Apr 6, 2020
It is not strictly necessary to return the list of device nodes in order
to trigger the NVIDIA container stack to injet a set of GPUs into a
container. However, without this list, the container runtime (i.e.
docker) will not be aware of the device nodes injected into the
container by the NVIDIA container stack.

This normally has little to no consequence for containers relying on the
NVIDIA contaienr stack to allocatn GPUs. However, problems arise
when using GPUs in conjunction with the kubernetes CPUManager. The
following issue summarizes the problem well:

NVIDIA/nvidia-docker#966

With this patch, we add a flag to optionally pass back the list of
device nodes that the NVIDIA container stack will inject into the
container so that the kubelet can forward it to the container runtime.
With this small change, the above issue no longer gets triggered.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
@RenaudWasTaken
Copy link
Contributor

@klueska fixed this in the latest NVIDIA device-plugin (beta5) release. Though this is a workaround.

@klueska
Copy link
Contributor

klueska commented Apr 8, 2020

Note, to use this workaround you will need to use the new daemonset spec nvidia-device-plugin-compat-with-cpumanager.yml instead of the default one.

This spec does two things different from the default one:

  1. It adds a new argument to the plugin executable for --pass-device-specs
  2. It launches the plugin as --privileged

If you don't want to use the --privileged flag, then things will still "work" in terms of allowing pods with GPUs to run, but you will see the plugin restart anytime a container with guaranteed CPUs from the CPUManager starts. If you are OK with this restart, then launching the daemonset as --privileged is not strictly necessary.

paroque28 pushed a commit to paroque28/k8s-device-plugin that referenced this issue May 14, 2020
It is not strictly necessary to return the list of device nodes in order
to trigger the NVIDIA container stack to injet a set of GPUs into a
container. However, without this list, the container runtime (i.e.
docker) will not be aware of the device nodes injected into the
container by the NVIDIA container stack.

This normally has little to no consequence for containers relying on the
NVIDIA contaienr stack to allocatn GPUs. However, problems arise
when using GPUs in conjunction with the kubernetes CPUManager. The
following issue summarizes the problem well:

NVIDIA/nvidia-docker#966

With this patch, we add a flag to optionally pass back the list of
device nodes that the NVIDIA container stack will inject into the
container so that the kubelet can forward it to the container runtime.
With this small change, the above issue no longer gets triggered.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
@klueska klueska closed this as completed May 22, 2020
@zionwu
Copy link

zionwu commented Nov 26, 2020

@klueska Is there a PR of device plugin for this issue? I would like to learn about the fix.

@klueska
Copy link
Contributor

klueska commented Nov 26, 2020

@zionwu
Copy link

zionwu commented Dec 7, 2020

@klueska nvidia-device-plugin-compat-with-cpumanager.yml is using nvidia/k8s-device-plugin:v0.7.1. My current cluster is using nvidia/k8s-device-plugin:1.11. Is k8s-device-plugin:v0.7.1 a newer version than k8s-device-plugin:1.11? Can I upgrade the ds directly without breaking things on my production?

@klueska
Copy link
Contributor

klueska commented Dec 7, 2020

@zionwu, Yes v0.7.1 is newer than 1.11 and you should be able to upgrade without any issues.

Please see https://github.com/NVIDIA/k8s-device-plugin#versioning for info on versioning / upgrading.

Also, keep in mind that the semantics around deploying the plugin on nodes that do not have GPUs has slightly changed. You may need to set this flag to false in your daemonset if you rely on the ability to deploy it on non-GPU nodes and not error out:
NVIDIA/k8s-device-plugin@2a9b835

@zionwu
Copy link

zionwu commented Dec 7, 2020

Got it. Thank you ! @klueska

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants